PCA (Principal Component Analysis) is a widely used technique for dimensionality reduction. The time complexity of PCA depends on the number of data-points n and the number of features p. In the worst-case scenario (based on the definition), the time complexity of PCA is O(p2n + p3), where n is the number of data-points and p is the number of features.
The computation of the covariance matrix contributes
O(p2n) to the overall complexity, while the
eigen-value decomposition adds O(p3). In R, using
the prcomp function with scaling to implement PCA as SVD
(Singular Value Decomposition) of the Dataset matrix results in a time
complexity of O(n2p +
p3).
For reference and more information: - StackOverflow - Complexity of PCA - Medium - Computational Complexity of PCA - SVD Implementation to Speed up PCA
t-SNE (t-distributed Stochastic Neighbor Embedding) is another dimensionality reduction technique. Its time complexity is O(n2), where n is the number of data-points.
For reference and more information: - t-SNE Paper (arXiv)
UMAP (Uniform Manifold Approximation and Projection) is a popular dimensionality reduction method. Its time complexity varies based on the dimension of the target reduced space d and the number of data-points n. In the average case, the complexity is O(d * n1.14), while in the worst-case scenario, it is O(n2).
For reference and more information: - UMAP GitHub Issue - ResearchGate - UMAP Publication
These are the needed parameters for dim-reduce:
k <- 7
seed <- 1234
results_path <- "../results/dim-reduce/R"
# t-SNE Parameters
tsne_perplexity <- 40
tsne_max_iter <- 1000
tsne_initial_dims <- 50
tsne_seed <- seed
# UMAP Parameters
umap_n_neighbors <- 15
umap_metric <- "euclidean"
umap_min_dist <- 0.1
umap_seed <- seed
color <- "variant"
shape <- "year"
include_plots <- TRUE
### OPTIONAL ###
# filter1_factor <- "variant"
# filter1_values <- c("Omicron", "Omicron Sub")
# filter2_factor <- "year"
# filter2_values <- c("2023")
# GET KMERS FROM PRE-WRITTEN FILES (depends on strat_size)
# LOAD SOURCES #############################################
strat_size <- 100
kmers_data_path <- "../data/kmers"
k_path <- sprintf("%s/kmer_%d_%d.csv", kmers_data_path, k, strat_size)
message(sprintf("Reading %s for later... ", k_path), appendLF = FALSE)
## Reading ../data/kmers/kmer_7_100.csv for later...
kmers <- utils::read.csv(k_path)
message("DONE!")
## DONE!
PCA is a popular method for dimensionality reduction that transforms high-dimensional data into a lower-dimensional space. It identifies the principal components, which are linear combinations of the original features that capture the most variance in the data. In this code, PCA is used to reduce the dimensionality of the scaled data (x). The results are plotted in 2D and 3D PCA plots, screeplot, factor loadings plot, and many other more.
pre_reduce_res <- pre_reduce(results_path, kmers, k, filter1_factor,
filter1_values, filter2_factor, filter2_values)
## [1] "Pre-processing and scaling data..."
df <- pre_reduce_res$df # df is the original dataset
x <- pre_reduce_res$x # x is the scaled data
# Perform PCA
pca_df <- pca_fn(x)
## [1] "Performing PCA..."
t-SNE is a nonlinear dimensionality reduction technique that is particularly useful for visualizing high-dimensional data in a lower-dimensional space while preserving local structures. In this code, t-SNE is performed using the tsne or Rtsne library (based on user selection) on the PCA results. The function generates 2D and 3D t-SNE plots, which represent the data points in a way that maintains the similarity between nearby points.
# Perform t-SNE via 'tsne' library using PCA results (in 3 dimensions)
# # Note: Uncomment the next two line to use tsne; otherwise, comment them
tsne_df <- tsne_fn(pca_df$x, 3, tsne_initial_dims, tsne_perplexity,
tsne_max_iter, tsne_seed)
# Perform t-SNE via 'Rtsne' library using PCA results (in 3 dimensions)
# # Note: Uncomment the next two lines to use Rtsne; otherwise, comment them
# is_tsne <- FALSE
# tsne_df <- rtsne_fn(pca_df$x, 3, tsne_perplexity, tsne_max_iter, tsne_seed)
UMAP is another nonlinear dimensionality reduction method that seeks to preserve both local and global structures of the data. It is known for its ability to scale to large datasets. In this code, UMAP is used to project the scaled data (x) into a 3D space. The function generates 2D and 3D UMAP plots, which provide a reduced representation of the data while preserving the relationships between data points.
# Perform UMAP (in 3 dimensions)
umap_df <- umap_fn(x, 3, umap_n_neighbors, umap_metric,
umap_min_dist, umap_seed)
# Generate 2D PCA plot
pcolor <- df[[color]]
pshape <- df[[shape]]
p <- autoplot(pca_df, data = df) +
geom_point(aes(color = pcolor, shape = pshape, text = paste(
"Identifier: ", df$gisaid_epi_isl, "\n",
"Variant: ", df$variant, "\n",
"Sex: ", df$sex, "\n",
"Division Exposure: ", df$division_exposure, "\n",
"Year: ", format(as.Date(df$date), "%Y"), "\n",
"Strain: ", df$strain, "\n",
"Pangolin Lineage: ", df$pangolin_lineage
))) +
scale_color_brewer(palette = "Set1")
## Warning in geom_point(aes(color = pcolor, shape = pshape, text =
## paste("Identifier: ", : Ignoring unknown aesthetics: text
p <- ggplotly(p)
p
# Generate 3D PCA plot
p <- plot_ly(as.data.frame(pca_df$x[, 1:3]),
x = ~PC1, y = ~PC2, z = ~PC3, type = "scatter3d",
mode = "markers", color = df[[color]], symbol = df[[shape]],
text = paste(
"Identifier: ", df$gisaid_epi_isl, "\n",
"Variant: ", df$variant, "\n",
"Sex: ", df$sex, "\n",
"Division Exposure: ", df$division_exposure, "\n",
"Year: ", format(as.Date(df$date), "%Y"), "\n",
"Strain: ", df$strain, "\n",
"Pangolin Lineage: ", df$pangolin_lineage)
)
p <- ggplotly(p)
p
# Generate screeplot
p <- fviz_eig(pca_df,
xlab = "Number of Principal Components")
p <- ggplotly(p)
p
## Warning: 'bar' objects don't have these attributes: 'mode'
## Valid attributes include:
## '_deprecated', 'alignmentgroup', 'base', 'basesrc', 'cliponaxis', 'constraintext', 'customdata', 'customdatasrc', 'dx', 'dy', 'error_x', 'error_y', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'insidetextanchor', 'insidetextfont', 'legendgroup', 'legendgrouptitle', 'legendrank', 'marker', 'meta', 'metasrc', 'name', 'offset', 'offsetgroup', 'offsetsrc', 'opacity', 'orientation', 'outsidetextfont', 'selected', 'selectedpoints', 'showlegend', 'stream', 'text', 'textangle', 'textfont', 'textposition', 'textpositionsrc', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'width', 'widthsrc', 'x', 'x0', 'xaxis', 'xcalendar', 'xhoverformat', 'xperiod', 'xperiod0', 'xperiodalignment', 'xsrc', 'y', 'y0', 'yaxis', 'ycalendar', 'yhoverformat', 'yperiod', 'yperiod0', 'yperiodalignment', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
# Generate factor loadings plot of first 3 principal components
n_components <- 3
# Extract factor loadings
loadings <- pca_df$rotation
# Plot bar plots for factor loadings of n principal components
for (i in 1:n_components) {
# Create a data frame for the factor loadings
loadings_df <- data.frame(
variable = colnames(x),
loading = loadings[, i]
)
# Create a bar plot using ggplot2
p <- ggplot(loadings_df, aes(x = variable, y = loading)) +
geom_bar(stat = "identity", fill = "blue") +
labs(
title = paste("Principal Component", i),
x = "Variables", y = "Factor Loadings"
)
p <- ggplotly(p)
p
}
# Generate graph of individuals
p <- fviz_pca_ind(pca_df,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE, # Avoid text overlapping
xlab = "PC1",
ylab = "PC2")
p <- ggplotly(p)
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
p
# Generate graph of variables
p <- fviz_pca_var(pca_df,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE, # Avoid text overlapping
xlab = "PC1",
ylab = "PC2")
p <- ggplotly(p)
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
p
# Generate biplot
p <- fviz_pca_biplot(pca_df,
col.var = "#2E9FDF", # Variables color
col.ind = "#696969", # Individuals color
addEllipses = TRUE,
xlab = "PC1",
ylab = "PC2")
p <- ggplotly(p)
p
# Generate 2D t-SNE plot
tsne_df_2d <- data.frame(X1 = tsne_df[, 1], X2 = tsne_df[, 2],
color = df[[color]], shape = df[[shape]])
p <- ggplot(tsne_df_2d, aes(x = X1, y = X2, color = color, shape = shape)) +
geom_point(aes(text = paste(
"Identifier: ", df$gisaid_epi_isl, "\n",
"Variant: ", df$variant, "\n",
"Sex: ", df$sex, "\n",
"Division Exposure: ", df$division_exposure, "\n",
"Year: ", format(as.Date(df$date), "%Y"), "\n",
"Strain: ", df$strain, "\n",
"Pangolin Lineage: ", df$pangolin_lineage
))) +
xlab("TSNE-2D-1") +
ylab("TSNE-2D-2") +
scale_color_brewer(palette = "Set1")
## Warning in geom_point(aes(text = paste("Identifier: ", df$gisaid_epi_isl, :
## Ignoring unknown aesthetics: text
p <- ggplotly(p)
p
tsne_df_3d <- data.frame(X1 = tsne_df[, 1], X2 = tsne_df[, 2], X3 = tsne_df[, 3],
color = df[[color]], shape = df[[shape]])
final <- cbind(data.frame(tsne_df_3d), df[[color]], df[[shape]])
p <- plot_ly(final,
x = ~X1, y = ~X2, z = ~X3, type = "scatter3d", mode = "markers",
color = df[[color]], symbol = ~shape,
text = paste(
"Identifier: ", df$gisaid_epi_isl, "<br>",
"Variant: ", df$variant, "<br>",
"Sex: ", df$sex, "<br>",
"Division Exposure: ", df$division_exposure, "<br>",
"Year: ", format(as.Date(df$date), "%Y"), "<br>",
"Strain: ", df$strain, "<br>",
"Pangolin Lineage: ", df$pangolin_lineage
)
)
p <- ggplotly(p)
p
# Generate 2D UMAP plot
emb <- umap_df$layout
x_o <- emb[, 1]
y_o <- emb[, 2]
p <- ggplot(
data = as.data.frame(emb),
aes(x = x_o, y = y_o, color = df[[color]], shape = df[[shape]])
) +
geom_point(aes(text = paste(
"Identifier: ", df$gisaid_epi_isl, "\n",
"Variant: ", df$variant, "\n",
"Sex: ", df$sex, "\n",
"Division Exposure: ", df$division_exposure, "\n",
"Year: ", format(as.Date(df$date), "%Y"), "\n",
"Strain: ", df$strain, "\n",
"Pangolin Lineage: ", df$pangolin_lineage
))) +
xlab("UMAP_1") +
ylab("UMAP_2") +
labs(shape = "shape", color="color") +
scale_color_brewer(palette = "Set1")
## Warning in geom_point(aes(text = paste("Identifier: ", df$gisaid_epi_isl, :
## Ignoring unknown aesthetics: text
p <- ggplotly(p)
p
# Generate 3D UMAP plot
final <- cbind(data.frame(umap_df[["layout"]]), color, shape)
p <- plot_ly(final,
x = ~X1, y = ~X2, z = ~X3, type = "scatter3d", mode = "markers",
color = df[[color]], symbol = ~shape,
text = paste(
"Identifier: ", df$gisaid_epi_isl, "<br>",
"Variant: ", df$variant, "<br>",
"Sex: ", df$sex, "<br>",
"Division Exposure: ", df$division_exposure, "<br>",
"Year: ", format(as.Date(df$date), "%Y"), "<br>",
"Strain: ", df$strain, "<br>",
"Pangolin Lineage: ", df$pangolin_lineage
)
)
p <- p %>% add_markers()
p <- p %>% layout(scene = list(
xaxis = list(title = "0"),
yaxis = list(title = "1"),
zaxis = list(title = "2")
))
p <- ggplotly(p)
p
## R Session Info:
## ```r
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22621)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_Philippines.utf8 LC_CTYPE=English_Philippines.utf8
## [3] LC_MONETARY=English_Philippines.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Philippines.utf8
##
## time zone: Asia/Singapore
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] glue_1.6.2 highcharter_0.9.4 data.table_1.14.8
## [4] microbenchmark_1.4.10 colorspace_2.1-0 cluster_2.1.4
## [7] dendextend_1.17.1 ggdendro_0.1.23 devtools_2.4.5
## [10] usethis_2.2.2 ggfortify_0.4.16 RColorBrewer_1.1-3
## [13] tsne_0.1-3.1 Rtsne_0.16 scales_1.2.1
## [16] factoextra_1.0.7.999 htmlwidgets_1.6.2 umap_0.2.10.0
## [19] seqinr_4.2-30 gsubfn_0.7 proto_1.0.0
## [22] validate_1.1.3 kmer_1.1.2 ape_5.7-1
## [25] forcats_1.0.0 purrr_1.0.1 readr_2.1.4
## [28] tibble_3.2.1 tidyverse_2.0.0 tidyr_1.3.0
## [31] stringr_1.5.0 shiny_1.7.4.1 rmarkdown_2.23
## [34] markdown_1.7 rio_0.5.29 psych_2.3.6
## [37] plotly_4.10.2 lubridate_1.9.2 httr_1.4.6
## [40] ggvis_0.4.8 ggthemes_4.2.4 GGally_2.1.2
## [43] ggplot2_3.4.2 dplyr_1.1.2 plyr_1.8.8
##
## loaded via a namespace (and not attached):
## [1] rstudioapi_0.15.0 jsonlite_1.8.5 magrittr_2.0.3 farver_2.1.1
## [5] fs_1.6.2 vctrs_0.6.3 memoise_2.0.1 askpass_1.1
## [9] rstatix_0.7.2 htmltools_0.5.5 curl_5.0.1 haven_2.5.2
## [13] broom_1.0.5 cellranger_1.1.0 TTR_0.24.3 sass_0.4.6
## [17] bslib_0.5.0 zoo_1.8-12 cachem_1.0.8 igraph_1.5.0.1
## [21] mime_0.12 lifecycle_1.0.3 pkgconfig_2.0.3 Matrix_1.5-4.1
## [25] R6_2.5.1 fastmap_1.1.1 digest_0.6.32 reshape_0.8.9
## [29] ps_1.7.5 RSpectra_0.16-1 pkgload_1.3.2.1 crosstalk_1.2.0
## [33] ggpubr_0.6.0 labeling_0.4.2 fansi_1.0.4 timechange_0.2.0
## [37] abind_1.4-5 compiler_4.3.1 remotes_2.4.2 withr_2.5.0
## [41] backports_1.4.1 carData_3.0-5 viridis_0.6.3 pkgbuild_1.4.2
## [45] ggsignif_0.6.4 MASS_7.3-60 openssl_2.0.6 sessioninfo_1.2.2
## [49] tools_4.3.1 foreign_0.8-84 quantmod_0.4.24 zip_2.3.0
## [53] httpuv_1.6.11 callr_3.7.3 nlme_3.1-162 promises_1.2.0.1
## [57] grid_4.3.1 ade4_1.7-22 generics_0.1.3 gtable_0.3.3
## [61] tzdb_0.4.0 hms_1.1.3 car_3.1-2 utf8_1.2.3
## [65] ggrepel_0.9.3 pillar_1.9.0 later_1.3.1 lattice_0.21-8
## [69] tidyselect_1.2.0 phylogram_2.1.0 miniUI_0.1.1.1 knitr_1.43
## [73] gridExtra_2.3 xfun_0.39 stringi_1.7.12 lazyeval_0.2.2
## [77] yaml_2.3.7 pacman_0.5.1 evaluate_0.21 tcltk_4.3.1
## [81] settings_0.2.7 cli_3.6.1 xtable_1.8-4 reticulate_1.30
## [85] munsell_0.5.0 processx_3.8.1 jquerylib_0.1.4 Rcpp_1.0.10
## [89] readxl_1.4.2 png_0.1-8 parallel_4.3.1 ellipsis_0.3.2
## [93] assertthat_0.2.1 prettyunits_1.1.1 profvis_0.3.8 urlchecker_1.0.1
## [97] viridisLite_0.4.2 rlist_0.4.6.2 xts_0.13.1 openxlsx_4.2.5.2
## [101] crayon_1.5.2 rlang_1.1.1 mnormt_2.1.1
##
## ```
## Session information saved to: R_session_info.txt